Fine-Tune ViT for Vehicle Image Classification with Hugging Face Transformers 🤗¶

Problem Statement¶

Fine-tune the Hugging Face Vision Transformer(ViT) with Pytorch for Vehicle Type Image Classification, training gradually unfreezing layers strating from the end vs train a model with all layers unfrozen from the get-go

Downlaoding the dataset from roboflow¶

We'll be downloading the image dataset from Roboflow - https://universe.roboflow.com/paul-guerrie-tang1/vehicle-classification-eapcd

The dataset will be saved in the following structure:

Dataset structure.png
In [ ]:
# Downloading the dataset
from roboflow import Roboflow
import os
if not os.path.exists("./Vehicle-Classification-1/"): 
    rf = Roboflow(api_key="QcgQPG8g2tj3r0ottt5l")
    project = rf.workspace("paul-guerrie-tang1").project("vehicle-classification-eapcd")
    dataset = project.version(1).download("folder")

Importing all the required packages¶

In [ ]:
import torch
from torchvision.datasets import ImageFolder
import torchvision.transforms as transforms
from pathlib import Path
import matplotlib.pyplot as plt
from PIL import Image
from torch.utils.data import DataLoader
from transformers import ViTFeatureExtractor, ViTForImageClassification, ViTImageProcessor
import tqdm as notebook_tqdm
import pytorch_lightning as pl
from torchmetrics import Accuracy
import numpy as np

import warnings
warnings.filterwarnings('ignore')
In [ ]:
# Assigning device based on windows or macOS
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print("Running on", device)
Running on cuda

Loading the data¶

In [ ]:
# Creating a directory path to our dataset
data_dir = Path('Vehicle-Classification-1')
train_data_dir = Path('Vehicle-Classification-1/train')
val_data_dir = Path('Vehicle-Classification-1/valid')
test_data_dir = Path('Vehicle-Classification-1/test')
ds = ImageFolder(data_dir)

# Assigning Train, Test and Valid images to their respective set
train_ds = ImageFolder(train_data_dir)
val_ds = ImageFolder(val_data_dir)
test_ds = ImageFolder(test_data_dir)

Showing some examples for each category in vehicle dataset¶

In [ ]:
import os
test_path = './Vehicle-Classification-1/test/'
plt.figure(figsize=(60, 50))
classes = os.listdir(test_path)[:-1]
for i, class_folder in enumerate(classes):
    image_name = os.listdir(os.path.join(test_path, class_folder))[0]
    
    plt.subplot(4, len(train_ds.classes)//4, i+1)
    ax = plt.gca()
    ax.set_title(
        class_folder,
        size='xx-large',
        pad=2,
        loc='left',
        y=0,
        backgroundcolor='white'
    )
    ax.axis('off')
    image = Image.open(os.path.join(test_path, class_folder, image_name))
    plt.imshow(image)
    plt.axis('off')

Preparing Labels for Our Model's Config¶

In [ ]:
# Creating label dictionaries for our model configurations
label2id = {}
id2label = {}

for i, class_name in enumerate(train_ds.classes):
    label2id[class_name] = str(i)
    id2label[str(i)] = class_name
In [ ]:
label2id
Out[ ]:
{'Ambulance': '0',
 'Barge': '1',
 'Bicycle': '2',
 'Boat': '3',
 'Bus': '4',
 'Car': '5',
 'Cart': '6',
 'Caterpillar': '7',
 'Helicopter': '8',
 'Limousine': '9',
 'Motorcycle': '10',
 'Segway': '11',
 'Snowmobile': '12',
 'Tank': '13',
 'Taxi': '14',
 'Truck': '15',
 'Van': '16'}
In [ ]:
id2label
Out[ ]:
{'0': 'Ambulance',
 '1': 'Barge',
 '2': 'Bicycle',
 '3': 'Boat',
 '4': 'Bus',
 '5': 'Car',
 '6': 'Cart',
 '7': 'Caterpillar',
 '8': 'Helicopter',
 '9': 'Limousine',
 '10': 'Motorcycle',
 '11': 'Segway',
 '12': 'Snowmobile',
 '13': 'Tank',
 '14': 'Taxi',
 '15': 'Truck',
 '16': 'Van'}

Image Classification Collator¶

For preprocessing we would be creating a custom image classification collator which would help us collate batches.
The encodings runs through the feature_extractor then for x in batch, it sets x[0] to be the image and converts it into PyTorch tensors, and sets encodings[‘labels’] to be x[1] for x in batch and saves it as a torch.long

In [ ]:
# Creating custom image classification collator function
class ImageClassificationCollator:
    def __init__(self, feature_extractor):
        self.feature_extractor = feature_extractor
 
    def __call__(self, batch):
        encodings = self.feature_extractor([x[0] for x in batch], return_tensors='pt')
        encodings['labels'] = torch.tensor([x[1] for x in batch], dtype=torch.long)
        encodings['labels'] = encodings['labels'].to(device)
        return encodings

Initializing Feature Extractor and Data Loaders¶

Using the HuggingFace ViTFeatureExtractor, we will extract the pretrained input features from the ‘google/vit-base-patch16–224-in21k’ model and then prepare the image to be passed through our custom image collator.
The collator instance will be used as the parameter called collate_fn in the Pytorch DataLoader.
The DataLoader helps to parallelize the data loading and automatically helps to make batches from the dataset.

In [ ]:
# Initializing Feature Extractor and Data Loaders
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')

collator = ImageClassificationCollator(feature_extractor)

train_loader = DataLoader(train_ds, batch_size=256, collate_fn=collator, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=256, collate_fn=collator, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=8, collate_fn=collator, shuffle=True)

Creating Classifier class to fine-tune the model¶

Now we will be creating our Classifier class to fine-tune our model. Our Classifier will take in two arguments, which are the model and the learning rate.
This class will carry out the training steps, validation steps as well as configuring of the AdamW optimizer.

In [ ]:
class Classifier(pl.LightningModule):

    def __init__(self, model, lr: float = 2e-5, **kwargs):
        super().__init__()
        self.save_hyperparameters('lr', *list(kwargs))
        self.model = model
        self.forward = self.model.forward
        self.val_acc = Accuracy(
            task='multiclass' if model.config.num_labels > 2 else 'binary',
            num_classes=model.config.num_labels
        )

    def training_step(self, batch, batch_idx):
        outputs = self(**batch)
        self.log(f"train_loss", outputs.loss)
        return outputs.loss

    def validation_step(self, batch, batch_idx):
        outputs = self(**batch)
        self.log(f"val_loss", outputs.loss)
        acc = self.val_acc(outputs.logits.argmax(1), batch['labels'])
        self.log(f"val_acc", acc, prog_bar=True)
        return outputs.loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)

Building, Training and Evaluating the Models¶

We then load the VITForImageClassification pretrained model to our variable model.

Model 1.1: Freezing the body layers and training the head layers
In this steps we will also check if a model trained this way is already present, in that case we proceed to load it.

In [ ]:
# Building or Loading the model
if os.path.exists('./models/head_trained.pt'):
    if torch.backends.mps.is_available():
        model = torch.load('./models/head_trained.pt', map_location ='mps')
    elif torch.cuda.is_available():
        model = torch.load('./models/head_trained.pt')
    else:
        model = torch.load('./models/head_trained.pt', map_location ='cpu')
else:
    model = ViTForImageClassification.from_pretrained(
        'google/vit-base-patch16-224-in21k',
        num_labels=len(label2id),
        label2id=label2id,
        id2label=id2label
    )
# Freezing main body layers
for p in model.vit.parameters():
    p.requires_grad = False

model = model.to(device)

Before we train our model, we are going to first set the seed for pseudo-random number generators to be 42.
This just enables us to obtain consistent results when rerunning. We will be passing two arguments into our Classifier which are the model created earlier and the learning rate(we will be using 2e-5).
Finally to wrap it all up, we will pass our classifier, train_loader, and val_loader into trainer.fit.
Once again, if the training has already been performed we avoid running it one more time.

In [ ]:
# Training the model
if not os.path.exists('./models/head_trained.pt'):
    pl.seed_everything(42)

    classifier = Classifier(model, lr=2e-5)

    if torch.cuda.is_available():
        trainer = pl.Trainer(accelerator='cuda', devices=1, precision='bf16-mixed', max_epochs=50)
    elif torch.backends.mps.is_available():
        trainer = pl.Trainer(accelerator='mps', devices=1, precision='bf16-mixed', max_epochs=50)
    else:
        trainer = pl.Trainer(accelerator='cpu', devices=1, precision='bf16-mixed', max_epochs=50)

    trainer.fit(classifier, train_loader, val_loader)
    torch.save(model,'./models/head_trained.pt')

Once the training is done, we would then create a check_accuracy function to calculate the test accuracy our model.

In [ ]:
# Creating a function to calculate the accuracy of the model
def check_accuracy(test_loader: DataLoader, model, device):
    num_correct = 0
    total = 0
    model.eval()
    model.to(device)
    with torch.no_grad():
        for batch in test_loader:
            
            data = batch['pixel_values'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(data)
            predictions = outputs.logits.softmax(1).argmax(1)

            num_correct += (predictions == labels).sum()
            total += labels.size(0)

        print(f"Test Accuracy of the model: {float(num_correct)/float(total)*100:.2f}")
In [ ]:
# Evaluating the model
check_accuracy(test_loader, model, device)
Test Accuracy of the model: 90.21

As we can see the model reached a good level of accuracy, as testified by the validation loss and accuracy graphs plotted by tensorflow for this run.

Run data for frozen model with open head

Model 1.2: Training the last layers
After training the head we proceed to open a few more layers.

In [ ]:
for p in model.vit.layernorm.parameters():
    p.requires_grad = True
for layer in model.vit.encoder.layer:
    for p in layer.layernorm_after.parameters():
        p.requires_grad = True
for layer in model.vit.encoder.layer:
    for p in layer.layernorm_after.parameters():
        p.requires_grad = True

Once again we proceed with the training, if a model trained this way is not already present.

In [ ]:
# Training the model
if os.path.exists('./models/last_layers_trained.pt'):
    if torch.backends.mps.is_available():
        model = torch.load('./models/last_layers_trained.pt', map_location ='mps')
    elif torch.cuda.is_available():
        model = torch.load('./models/last_layers_trained.pt')
    else:
        model = torch.load('./models/last_layers_trained.pt', map_location='cpu')
else:
    model.train()

    train_loader = DataLoader(train_ds, batch_size=8, collate_fn=collator, shuffle=True)
    val_loader = DataLoader(val_ds, batch_size=8, collate_fn=collator, shuffle=True)

    pl.seed_everything(42)

    classifier = Classifier(model, lr=2e-5)
    
    if torch.cuda.is_available():
        trainer = pl.Trainer(accelerator='cuda', devices=1, precision='bf16-mixed', max_epochs=10)
    elif torch.backends.mps.is_available():
        trainer = pl.Trainer(accelerator='mps', devices=1, precision='bf16-mixed', max_epochs=10)
    else:
        trainer = pl.Trainer(accelerator='cpu', devices=1, precision='bf16-mixed', max_epochs=10)
    
    trainer.fit(classifier, train_loader, val_loader)
    torch.save(model,'./models/last_layers_trained.pt')
In [ ]:
# Evaluating the model
check_accuracy(test_loader, model, device)
Test Accuracy of the model: 94.13

We can notice an improvement in the accuracy, and once again this is also verifiable from the tensorboard plots.

Run data for frozen model with open head We can see how the current run (in pink) had a less step improvement, and considering this training only lasted for 10 epochs, compared to the 50 used for the previous run each epochs required more time. This was expect as more trainable paramenters require a longer time to perform the backward propagation.

Model 1.3: Training with all the layers
We now proceed to open all layers and allow all the model to be trained.

In [ ]:
for p in model.parameters():
    p.requires_grad = True

As usual we avoid computationally expensive training if it's not necessary.

In [ ]:
# Training the model
if os.path.exists('./models/full_model_trained.pt'):
    if torch.backends.mps.is_available():
        model = torch.load('./models/full_model_trained.pt', map_location ='mps')
    elif torch.cuda.is_available():
        model = torch.load('./models/full_model_trained.pt')
    else:
        model = torch.load('./models/full_model_trained.pt', map_location='cpu')
else:
    torch.set_float32_matmul_precision('medium')
    model.train()

    train_loader = DataLoader(train_ds, batch_size=8, collate_fn=collator, shuffle=True)
    val_loader = DataLoader(val_ds, batch_size=8, collate_fn=collator, shuffle=True)
    
    pl.seed_everything(42)
    
    classifier = Classifier(model, lr=2e-5)
    
    if torch.cuda.is_available():
        trainer = pl.Trainer(accelerator='cuda', devices=1, precision='bf16-mixed', max_epochs=10)
    elif torch.backends.mps.is_available():
        trainer = pl.Trainer(accelerator='mps', devices=1, precision='bf16-mixed', max_epochs=10)
    else:
        trainer = pl.Trainer(accelerator='cpu', devices=1, precision='bf16-mixed', max_epochs=10)
    
    trainer.fit(classifier, train_loader, val_loader)
    torch.save(model,'./models/full_model_trained.pt')
In [ ]:
# Evaluating the model
check_accuracy(test_loader, model, device)
Test Accuracy of the model: 93.63

Run data for frozen model with open head
Surprisingly the accuracy did not improve on this run (in light blue). This might be the result of a series of factors, from reaching the limit of this model to overfitting. We can see the loss incresing during these 10 epochs and the accuracy having an erratic behaviour. Once again we can clearly see this 10 epochs required more time then the previous run due to a larger number of trainable parameters.

Model 2: Training all the layers at once
We will now fine-tune the model leaving all the layers open from the very beggining. We perform this training for 60 epochs.

In [ ]:
# Building or Loading the model
if os.path.exists('./models/all_layers_at_once_model.pt'):
    if torch.backends.mps.is_available():
        model2 = torch.load('./models/all_layers_at_once_model.pt', map_location ='mps')
    elif torch.cuda.is_available():
        model2 = torch.load('./models/all_layers_at_once_model.pt')
    else:
        model2 = torch.load('./models/all_layers_at_once_model.pt', map_location ='cpu')
else:
    model2 = ViTForImageClassification.from_pretrained(
        'google/vit-base-patch16-224-in21k',
        num_labels=len(label2id),
        label2id=label2id,
        id2label=id2label
    )

model2 = model2.to(device)
In [ ]:
# Evaluating the model
check_accuracy(test_loader, model2, device)
Test Accuracy of the model: 3.84

As expected the accuracy before training is very poor.

In [ ]:
train_loader = DataLoader(train_ds, batch_size=8, collate_fn=collator, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=8, collate_fn=collator, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=8, collate_fn=collator, shuffle=True)
# Training the model
if not os.path.exists('./models/all_layers_at_once_model.pt'):
    pl.seed_everything(42)

    classifier = Classifier(model2, lr=2e-5)

    if torch.cuda.is_available():
        trainer = pl.Trainer(accelerator='cuda', devices=1, precision='bf16-mixed', max_epochs=60)
    elif torch.backends.mps.is_available():
        trainer = pl.Trainer(accelerator='mps', devices=1, precision='bf16-mixed', max_epochs=50)
    else:
        trainer = pl.Trainer(accelerator='cpu', devices=1, precision='bf16-mixed', max_epochs=50)

    trainer.fit(classifier, train_loader, val_loader)
    torch.save(model2,'./models/all_layers_at_once_model.pt')
Global seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 3050 Ti Laptop GPU') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type                      | Params
------------------------------------------------------
0 | model   | ViTForImageClassification | 85.8 M
1 | val_acc | MulticlassAccuracy        | 0     
------------------------------------------------------
85.8 M    Trainable params
0         Non-trainable params
85.8 M    Total params
343.247   Total estimated model params size (MB)
Epoch 59: 100%|██████████| 2454/2454 [08:12<00:00,  4.99it/s, v_num=17, val_acc=0.928]
`Trainer.fit` stopped: `max_epochs=60` reached.
Epoch 59: 100%|██████████| 2454/2454 [08:15<00:00,  4.95it/s, v_num=17, val_acc=0.928]
In [ ]:
# Evaluating the model
check_accuracy(test_loader, model2, device)
Test Accuracy of the model: 93.84

Unexpectedly the last run with all the layers open from the start happened to reach very good accuracies and losses from the very first epoch.
Run data for frozen model with open head
We can notice the validation loss and accuracy are not improving after the first epoch and instead tend to degrade, this could be a sign of overfitting. It's also very clear that this training run was more computationally expensive and required a lot more time to complete, but within the first few epochs already achieved results comparable to the progressive fine-tuning performed on our first model.

Predicting the class of a vehicle image¶

In [ ]:
# Creating a prediction function to predict the class of the image
def prediction(img_path):
    im=Image.open(img_path)
    encoding = feature_extractor(images=im, return_tensors="pt")
    encoding = {k: v.to(device) for k, v in encoding.items()}
    encoding.keys()

    pixel_values = encoding['pixel_values']
    
    outputs = model(pixel_values)
    result = outputs.logits.softmax(1).argmax(1)
    new_result = result.tolist() 
    for i in new_result:
        return(id2label[str(i)])
In [ ]:
# Creating a process_image function to be processed before it can be shown
def process_image(image_path):
    pil_image = Image.open(image_path)

    if pil_image.size[0] > pil_image.size[1]:
        pil_image.thumbnail((5000, 256))
    else:
        pil_image.thumbnail((256, 5000))
    
    left_margin = (pil_image.width-224)/2
    bottom_margin = (pil_image.height-224)/2
    right_margin = left_margin + 224
    top_margin = bottom_margin + 224
    pil_image = pil_image.crop((left_margin, bottom_margin, 
                                right_margin, top_margin))
    np_image = np.array(pil_image)/255

    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    np_image = (np_image - mean) / std
    np_image = np_image.transpose((2, 0, 1))
    
    return np_image
In [ ]:
# Creating a imshow function to show the image
def imshow(image, ax=None, title=None):
    if ax is None:
        fig, ax = plt.subplots()
        
    image = image.transpose((1, 2, 0))

    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    image = std * image + mean

    if title is not None:
        ax.set_title(title)
    image = np.clip(image, 0, 1)
    
    ax.imshow(image)

    return ax
In [ ]:
# Creating a display_image function to display the image along with its title, and predicted vehicle type 
def display_image(image_dir):
   
    plt.figure(figsize = (10,10))
    plot_1 = plt.subplot(2,1,1)

    image = process_image(image_dir)

    pred= prediction(image_dir)

    plot_1.set_xlabel("The predicted vehicle type: "+pred)
    imshow(image, plot_1, title="Test image");
In [ ]:
# Path to the single image which to be tested
image_path1 = 'test.jpg'
In [ ]:
# Call the display_image function on the test image
display_image(image_path1)

As shown, vehcile type was correctly predicted ie Bicycle

Conclusions¶

We are surprised by these results, as we expected the training of the second model to yeld worse results overall and defientely did not imagine it would reach over 90% accuracy in one epoch. The only explanation that comes to mind is that the dataset is big enough to allow for a full training of the parameters of the model in one epoch. Furthermore we had to reduce the batch size in order to perform this training with the limitations of our GPU, this means more back propagation steps have been performed within one epoch compared to the ones of the previous model.

Reference¶

https://github.com/nateraw/huggingpics/tree/main: Used the code from this source for processing the image

https://huggingface.co/blog/fine-tune-vit: Used the code from this source to know how to inintialize or use the collator and dataloader

https://medium.com/@kenjiteezhen/image-classification-using-huggingface-vit-261888bfa19f: Used the code from this source to know how to use the trained model to predict the class of an image.